nlp_architect.utils package


nlp_architect.utils.ansi2html module

nlp_architect.utils.ansi2html.ansi2html(text, palette='solarized')[source], out)[source]

nlp_architect.utils.embedding module

class nlp_architect.utils.embedding.ELMoEmbedderTFHUB[source]

Bases: object

class nlp_architect.utils.embedding.FasttextEmbeddingsModel(size: int = 5, window: int = 3, min_count: int = 1, skipgram: bool = True)[source]

Bases: object

Fasttext embedding trainer class

  • texts (List[List[str]]) – list of tokenized sentences
  • size (int) – embedding size
  • epochs (int, optional) – number of epochs to train
  • window (int, optional) – The maximum distance between
  • current and predicted word within a sentence (the) –
classmethod load(path)[source]

load model from path

save(path) → None[source]

save model to path

train(texts: List[List[str]], epochs: int = 100)[source]
vec(word: str) → numpy.ndarray[source]

return vector corresponding given word

nlp_architect.utils.embedding.fill_embedding_mat(src_mat, src_lex, emb_lex, emb_size)[source]

Creates a new matrix from given matrix of int words using the embedding model provided.

  • src_mat (numpy.ndarray) – source matrix
  • src_lex (dict) – source matrix lexicon
  • emb_lex (dict) – embedding lexicon
  • emb_size (int) – embedding vector size
nlp_architect.utils.embedding.get_embedding_matrix(embeddings: dict, vocab: nlp_architect.utils.text.Vocabulary, embedding_size: int = None) → numpy.ndarray[source]

Generate a matrix of word embeddings given a vocabulary

  • embeddings (dict) – a dictionary of embedding vectors
  • vocab (Vocabulary) – a Vocabulary
  • embedding_size (int) – custom embedding matrix size

a 2D numpy matrix of lexicon embeddings

nlp_architect.utils.embedding.load_embedding_file(filename: str) → dict[source]

Load a word embedding file

Parameters:filename (str) – path to embedding file
Returns:dictionary with embedding vectors
Return type:dict
nlp_architect.utils.embedding.load_word_embeddings(file_path, vocab=None)[source]

Loads a word embedding model text file into a word(str) to numpy vector dictionary

  • file_path (str) – path to model file
  • vocab (list of str) – optional - vocabulary

a dictionary of numpy.ndarray vectors int: detected word embedding vector size

Return type:


nlp_architect.utils.file_cache module

Utilities for working with the local dataset cache.

nlp_architect.utils.file_cache.cached_path(url_or_filename: Union[str, pathlib.Path], cache_dir: str = None) → str[source]

Given something that might be a URL (or might be a local path), determine which. If it’s a URL, download the file and cache it, and return the path to the cached file. If it’s already a local path, make sure the file exists and then return the path.

nlp_architect.utils.file_cache.filename_to_url(filename: str, cache_dir: str = None) → Tuple[str, str][source]

Return the url and etag (which may be None) stored for filename. Raise FileNotFoundError if filename or its stored metadata do not exist.

nlp_architect.utils.file_cache.get_from_cache(url: str, cache_dir: str = None) → str[source]

Given a URL, look for the corresponding dataset in the local cache. If it’s not there, download it. Then return the path to the cached file.

nlp_architect.utils.file_cache.http_get(url: str, temp_file: IO) → None[source]
nlp_architect.utils.file_cache.url_to_filename(url: str, etag: str = None) → str[source]

Convert url into a hashed filename in a repeatable way. If etag is specified, append its hash to the url’s, delimited by a period.

nlp_architect.utils.generic module

nlp_architect.utils.generic.add_offset(mat: numpy.ndarray, offset: int = 1) → numpy.ndarray[source]

Add +1 to all values in matrix mat

  • mat (numpy.ndarray) – A 2D matrix with int values
  • offset (int) – offset to add

input matrix

Return type:


nlp_architect.utils.generic.license_prompt(model_name, model_website, dataset_dir=None)[source]
nlp_architect.utils.generic.normalize(txt, vocab=None, replace_char=' ', max_length=300, pad_out=True, to_lower=True, reverse=False, truncate_left=False, encoding=None)[source]
nlp_architect.utils.generic.one_hot(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source]

Convert a 1D matrix of ints into one-hot encoded vectors.

  • mat (numpy.ndarray) – A 1D matrix of labels (int)
  • num_classes (int) – Number of all possible classes

A 2D matrix

Return type:


nlp_architect.utils.generic.one_hot_sentence(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source]

Convert a 2D matrix of ints into one-hot encoded 3D matrix

  • mat (numpy.ndarray) – A 2D matrix of labels (int)
  • num_classes (int) – Number of all possible classes

A 3D matrix

Return type:


nlp_architect.utils.generic.pad_sentences(sequences: numpy.ndarray, max_length: int = None, padding_value: int = 0, padding_style='post') → numpy.ndarray[source]

Pad input sequences up to max_length values are aligned to the right

  • sequences (iter) – a 2D matrix (np.array) to pad
  • max_length (int, optional) – max length of resulting sequences
  • padding_value (int, optional) – padding value
  • padding_style (str, optional) – add padding values as prefix (use with ‘pre’) or postfix (use with ‘post’)

input sequences padded to size ‘max_length’

nlp_architect.utils.generic.to_one_hot(txt, vocab={'!': 40, '#': 49, '$': 50, '%': 51, '&': 53, '(': 61, ')': 62, '*': 54, '+': 57, ', ': 37, '-': 36, '.': 39, '/': 44, '0': 26, '1': 27, '2': 28, '3': 29, '4': 30, '5': 31, '6': 32, '7': 33, '8': 34, '9': 35, ':': 42, ';': 38, '<': 59, '=': 58, '>': 60, '?': 41, '@': 48, '[': 63, '\\': 45, ']': 64, '_': 47, 'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, '{': 65, '|': 46, '}': 66, 'ˆ': 52, '˜': 55, '‘': 56, '’': 43})[source] module[source][source]

Check if given directory exists, create if not.

Parameters:dir_path (str) – path to directory, max_size=None)[source][source], sourcefile, destfile, totalsz=None)[source]

Download the file specified by the given URL.

  • url (str) – url to download from
  • sourcefile (str) – file to download from url
  • destfile (str) – save path
  • totalsz (int, optional) – total size of file str, sourcefile: str, unzipped_path: str, license_msg: str = None)[source]

Downloads a zip file, extracts it to destination, deletes the zip file. If license_msg is supplied, user is prompted for download confirmation.[source]

Transform string to GZIP coding

Parameters:g_str (str) – string of data
Returns:GZIP bytes data[source]

for objects that have members that cant be serialized and implement toJson() method[source]

Utility function for getting number of lines in a text file., extension='txt')[source]

load all files from given directory (with given extension)[source]

load a file into a json object str, overwrite_output_dir: str)[source]

Create output directory or throw error if exists and overwrite_output_dir is false[source] str, outpath='.')[source]

Unzip a file to the same location of filepath uses decompressing algorithm by file extension

  • filepath (str) – path to file
  • outpath (str) – path to extract to, *args)[source]

Helper to validate passed path directory and append any subsequent filename arguments.

  • path (str) – Initial filesystem path. Should expand to a valid directory.
  • *args (list, optional) – Any filename or path suffices to append to path for returning.
  • Returns
    (list, str): path prepended list of files from args, or path alone if
    no args specified.

ValueError – if path is not a valid directory on this filesystem.*args)[source]

Validate all arguments are of correct type and in correct range. :param *args: Each tuple represents an argument validation like so: :type *args: tuple of tuples :param Option 1 - With range check: (arg, class, min_val, max_val) :param Option 2 - Without range check: (arg, class) :param If class is a tuple of type objects check if arg is an instance of any of the types.: :param To allow a None valued argument, include type: :type To allow a None valued argument, include type: None :param To disable lower or upper bound check, set min_val or max_val to None, respectively.: :param If arg has the len attribute: :type If arg has the len attribute: such as string[source]

Validates an input argument of type boolean[source]

Validates an input argument is a path string to an existing directory.[source]

Validates an input argument is a path string to an existing file.[source]

Validates an input argument is a path string to an existing file or directory.[source]

Validates an input argument is a path string, and its parent directory exists.[source]

Validates an input argument is a valid proxy path or None, verbose=False)[source]

Iterates a directory’s text files and their contents. str)[source]

List the files inside a given zip file

Parameters:filepath (str) – path to file
Returns:String list of filenames

nlp_architect.utils.metrics module

nlp_architect.utils.metrics.acc_and_f1(preds, labels)[source]

return accuracy and f1 score

nlp_architect.utils.metrics.accuracy(preds, labels)[source]

return simple accuracy in expected dict format

nlp_architect.utils.metrics.get_conll_scores(predictions, y, y_lex, unk='O')[source]

Get Conll style scores (precision, recall, f1)

nlp_architect.utils.metrics.pearson_and_spearman(preds, labels)[source]

get pearson and spearman correlation

nlp_architect.utils.metrics.simple_accuracy(preds, labels)[source]

return simple accuracy

nlp_architect.utils.metrics.tagging(preds, labels)[source]

nlp_architect.utils.string_utils module

class nlp_architect.utils.string_utils.StringUtils[source]

Bases: object

determiners = []
static find_head_lemma_pos_ner(x: str)[source]

Parameters:x – mention
Returns:the head word and the head word lemma of the mention
static is_determiner(in_str: str) → bool[source]
static is_preposition(in_str: str) → bool[source]
static is_pronoun(in_str: str) → bool[source]
static is_stop(token: str) → bool[source]
static normalize_str(in_str: str) → str[source]
static normalize_string_list(str_list: str) → List[str][source]
preposition = []
pronouns = []
spacy_no_parser = <nlp_architect.utils.text.SpacyInstance object>
spacy_parser = <nlp_architect.utils.text.SpacyInstance object>
stop_words = []

nlp_architect.utils.testing module

class nlp_architect.utils.testing.NLPArchitectTestCase(methodName='runTest')[source]



Hook method for setting up the test fixture before exercising it.


Hook method for deconstructing the test fixture after testing it.

nlp_architect.utils.text module

class nlp_architect.utils.text.SpacyInstance(model='en', disable=None, display_prompt=True)[source]

Bases: object

Spacy pipeline wrapper which prompts user for model download authorization.

  • model (str, optional) – spacy model name (default: english small model)
  • disable (list of string, optional) – pipeline annotators to disable (default: [])
  • display_prompt (bool, optional) – flag to display/skip license prompt

return Spacy’s instance parser

tokenize(text: str) → List[str][source]

Tokenize a sentence into tokens :param text: text to tokenize :type text: str

Returns:a list of str tokens of input
Return type:list
class nlp_architect.utils.text.Stopwords[source]

Bases: object

Stop words list class.

static get_words()[source]
stop_words = []
class nlp_architect.utils.text.Vocabulary(start=0, include_oov=True)[source]

Bases: object

A vocabulary that maps words to ints (storing a vocabulary)


Add word to vocabulary

Parameters:word (str) – word to add
Returns:id of added word
Return type:int

Adds an offset to the ints of the vocabulary

Parameters:offset (int) – an int offset

Word-id to word (string)

Parameters:wid (int) – word id
Returns:string of given word id
Return type:str

Return the vocabulary as a reversed dict object

Returns:reversed vocabulary object
Return type:dict

get the dict object of the vocabulary


Get the word_id of given word

Parameters:word (str) – word from vocabulary
Returns:int id of word
Return type:int
nlp_architect.utils.text.bio_to_spans(text: List[str], tags: List[str]) → List[Tuple[int, int, str]][source]

Convert BIO tagged list of strings into span starts and ends :param text: list of words :param tags: list of tags

Returns:list of start, end and tag of detected spans
Return type:tuple
return int id of given character
OOV char = len(all_letter) + 1
Parameters:c (str) – string character
Returns:int value of given char
Return type:int
nlp_architect.utils.text.character_vector_generator(data, start=0)[source]

Character word vector generator util. Transforms a list of sentences into numpy int vectors of the characters of the words of the sentence, and returns the constructed vocabulary

  • data (list) – list of list of strings
  • start (int, optional) – vocabulary index start integer

a 2D numpy array Vocabulary: constructed vocabulary

Return type:


nlp_architect.utils.text.extract_nps(annotation_list, text=None)[source]

Extract Noun Phrases from given text tokens and phrase annotations. Returns a list of tuples with start/end indexes.

  • annotation_list (list) – a list of annotation tags in str
  • text (list, optional) – a list of token texts in str

list of start/end markers of noun phrases, if text is provided a list of noun phrase texts


return character of given char id

nlp_architect.utils.text.read_sequential_tagging_file(file_path, ignore_line_patterns=None)[source]

Read a tab separated sequential tagging file. Returns a list of list of tuple of tags (sentences, words)

  • file_path (str) – input file path
  • ignore_line_patterns (list, optional) – list of string patterns to ignore

list of list of tuples


Simple text normalizer. Runs each token of a phrase thru wordnet lemmatizer and a stemmer.

nlp_architect.utils.text.spacy_normalizer(text, lemma=None)[source]

Simple text normalizer using spacy lemmatizer. Runs each token of a phrase thru a lemmatizer and a stemmer. :param text: the text to normalize. :type text: string :param lemma: lemma of the given text. in this case only stemmer will :type lemma: string :param run.:

nlp_architect.utils.text.word_vector_generator(data, lower=False, start=0)[source]

Word vector generator util. Transforms a list of sentences into numpy int vectors and returns the constructed vocabulary

  • data (list) – list of list of strings
  • lower (bool, optional) – transform strings into lower case
  • start (int, optional) – vocabulary index start integer

2D numpy array and Vocabulary of the detected words

Module contents